GitBucket
4.21.2
Toggle navigation
Snippets
Sign in
Files
Branches
1
Releases
Wiki
nigel.stanger
/
Wiki
Compare Revisions
View Page
Back to Page History
Transcribing lectures using Whisper.md
Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing. Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect. Assuming a good quality recording, the following settings seem to do a good job: * Medium model (specifically `medium.en`). This performs better than the large model when transcribing English because the `large` doesn’t have an English-specific model. * Enable word timestamps. * Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”. * VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.) * Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else. * An initial prompt may improve accuracy, e.g., “This is a postgraduate lecture about ethical issues in big data. The main topics are ethics, law, privacy, data dredging, and statistics.” For example: ```sh whisper --model medium.en --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 --initial_prompt "<prompt>" <input-file> ``` Models are downloaded to `~/.cache/whisper`. **Vibe** seems to be a useful cross-platform GUI implementation. Implemented in Rust, optimised for Apple Silicon, and supports Core ML models, so much faster than vanilla Whisper (written in Python and not optimised). Vibe claims to have a CLI, but I can’t figure out how to make it work. VS Code extension issues: * Missing feature: merge subtitles. Merges selected subtitles into one and adjusts timestamps accordingly. * Bug: Sometimes adjusting timing leads to timestamps like `00:40:48.1000`, which should actually be `00:40:49.000`. Clearly there is something slightly wonky with the arithmetic.
Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing. Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect. Assuming a good quality recording, the following settings seem to do a good job: * Medium model. This performs better than the large model when transcribing English because the large model isn’t English-specific. * Enable word timestamps. * Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”. * VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.) * Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else. * An initial prompt may improve accuracy, e.g., “This is a postgraduate lecture about ethical issues in big data. The main topics are ethics, law, privacy, data dredging, and statistics.” For example: ```sh whisper --model medium --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 --initial_prompt "<prompt>" <input-file> ``` VS Code extension issues: * Missing feature: merge subtitles. Merges selected subtitles into one and adjusts timestamps accordingly. * Bug: Sometimes adjusting timing leads to timestamps like `00:40:48.1000`, which should actually be `00:40:49.000`. Clearly there is something slightly wonky with the arithmetic.